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Abstract 

CO , Recently, metric learning and similarity learning have attracted a large amount of interest. 

■ Many models and optimisation algorithms have been proposed. However, there is relatively 
little work on the generalization analysis of such methods. In this paper, we derive novel 

■ generalization bounds of metric and similarity learning. In particular, we first show that the 
^ ^ , generalization analysis reduces to the estimation of the Rademacher average over "sums- 

^ ■ of-i.i.d." sample-blocks related to the specific matrix norm. Then, we derive generalization 

, bounds for metric/similarity learning with different matrix- norm regularisers by estimating 

their specific Rademacher complexities. Our analysis indicates that sparse metric/similarity 
learning with L^-norm regularisation could lead to significantly better bounds than those 
^ ■ with Frobenius-norm regularisation. Our novel generalization analysis develops and refines 

f — , the techniques of U-statistics and Rademacher complexity analysis. 

■ Keywords: Metric learning, Similarity learning. Generalization bound, Rademacher com- 
, plexity, U-statistics 
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The success of many machine learning algorithms (e.g. the nearest neighborhood classifi- 
cation and k-means clustering) depends on the concepts of distance metric and similarity. 
For instance, k-nearest-neighbor (kNN) classifier depends on a distance function to identify 
^ \ the nearest neighbors for classification; k-means algorithms depend on the pairwise distance 

measurements between examples for clustering. Kernel methods and information retrieval 
methods rely on a similarity measure between samples. Many existing studies have been 
devoted to learning a metric or similarity automatically from data, which is usually referred 
to as metric learning and similarity learning, respectively. 

Most work in metric learning focuses on learning a (squared) Mahalanobis distance de- 
fined, for any x, t G M*^, by dM{x, t) = (x — t)M{x — t)^ where M is a positive semi-definite 
matrix, see e.g. Bar-Hillel et al. (2005); Davis et al. (2007); Globerson and Roweis (2005); 
Goldberger et al. (2004); Shen et al. (2009); Weinberger and Saul (2008); Xing et al. (2002); 
Ying et al. (2009); Yang and Jin (2007). Concurrently, the pairwise similarity defined by 
SM{x,t) = xMt~^ was studied in Chechik et al. (2010); Shalit et al. (2010); Kar and Jain 
(2011); Maurer (2008). These methods have been successfully applied to to various real- 
world problems including information retrieval and face verification (Chechik et ah, 2010; 
Guillaumin et ah, 2009; Hoi et ah, 2006; Ying and Li, 2012). Although there are a large 
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number of studies devoted to supervised metric/similarity learning based on different ob- 
jective functions, few studies address the generalization anlysis of such methods. The recent 
work (Jin et al., 2009) pioneered the generalization analysis for metric learning using the 
concept of uniform stability (Bousequet and Elisseeff, 2002). However, this approach only 
works for the strongly convex norm, e.g. the Probenius norm, and the offset term is fixed 
which makes the generalization analysis essentially different. 

In this paper, we develop a novel approach for generalization analysis of metric learning 
and similarity learning which can deal with general matrix regularisation terms including 
Probenius norm (Jin et al., 2009), sparse L^-norm (Rosales and Fung, 2006), mixed (2, 1)- 
norm (Ying et al., 2009) and trace-norm (Ying et al., 2009; Shen et al., 2009). In particular, 
we first show that the generalization analysis for metric/similarity learning reduces to the 
estimation of the Rademacher average over "sums-of-i.i.d." sample-blocks related to the spe- 
cific matrix norm, which we refer to as the Rademacher complexity for metric (similarity) 
learning. Then, we show how to estimate the Rademacher complexities with different matrix 
regularisers. Our analysis indicates that sparse metric/similarity learning with L^-norm reg- 
ularisation could lead to significantly better generalization bounds than that with Frobenius 
norm regularisation, especially when the dimension of the input data is high. This is nicely 
consistent with the rationale that sparse methods are more effective for high-dimensional 
data analysis. Our novel generalization analysis develops and extends Rademacher com- 
plexity analysis (Bartlett and Mendelson, 2002; Koltchinskii and Panchenko, 2002) to the 
setting of metric/similarity learning by using techniques of U-statistics (Clemencon et al., 
2008; Peha and Gine, 1999). 

The paper is organized as follows. The next section reviews the models of metric/similarity 
learning. Section 3 establishes the main theorems. In Section 4, we derive and discuss gen- 
eralization bounds for metric/similarity learning with various matrix- norm regularisation 
terms. Section 5 concludes the paper. 

Notation: Let = {1, 2, . . . , n} for any n € N. For any X,Y e R'^'''^, {X, Y) = Tr{X^Y) 
where Tr(-) denotes the trace of a matrix. The space of symmetric d times d matrices will 
be denoted by S'^. We equip with a general matrix norm || • ||; it can be a Frobenius 
norm, trace-norm and mixed norm. Its associated dual norm is denoted, for any M G S"^, 
by = sup{(X, M) : X G S*^, \\X\\ < 1}. The Frobenius norm on matrices or vector 

is always denoted by || • \\f- The cone of positive semi-definite matrices is denoted by S'j[_. 
Later on we use the conventional notation that Xij = (xj — Xj){xi — Xj)^ and Xij = XixJ . 

2. Metric/Similarity Learning Formulation 

In our learning setting, we have a input space A' C M'* and output (labels) space y. Denote 
Z = X X y and suppose z := {z, = {xi,yi) £ Z : i £ N„} an i.i.d. training set according to 
an unknown distribution p on Z. Denote the d x n input data matrix by X = (xj : i G N„) 
and the d x d distance matrix by M = {M£k)£,keNa- Then, the (pseudo-) distance between 
Xi and Xj is measured by 

dMixi, Xj) = {xi - XjY M{xi - Xj). 

The goal of metric learning is to identify a distance function dM^Xi, Xj) such that it yields a 
small value for a similar pair and a large value for a dissimilar pair. The bilinear similarity 
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function is defined by 

SM{xi,Xj) = xj MXj. 

Similarly, the target of similarity learning is to learn M G S'^ such that it reports a large 
similarity value for a similar pair and a small similarity value for a dissimilar pair. It is 
worth of pointing out that we do not require the positive semi-definiteness of the matrix M 
throughout this paper. However, we do assume M to be symmetric, since this will guarantee 
the distance (similarity) between Xj and Xj {dM{xi,Xj)) is equivalent to that between xj 
and Xi {dM{xj,Xi)). 

There are two main terms in the metric/similarity learning model: empirical error and 
matrix regularisation term. The empirical error function is to employ the similarity and 
dissimilarity information provided by the label information and the appropriate matrix 
regularisation term is to avoid overfitting and improve generalization performance. 

For any pair of samples {xi,Xj), let r{yi,yj) = 1 if ?/j = yj otherwise r{yi,yj) = —1. It 
is expected that there exists an offset term 6 E R such that dM{xi,Xj) < b for r{yi,yj) = 1 
and dM{xi,Xj) > b otherwise. This naturally leads to the empirical error (Jin et al., 2009) 
defined by 

£^{M, b) := -r^—TT V I[r{yi,yj){dM{xi, xj) - b) > 0] 
nin — 1 ^-^ 

where the indicator function I[x] equal 1 if x is true and zero otherwise. 

Due to the indicator function, the above empirical error is non-differentiable and non- 
convex which is difficult to do optimisation. A usual way to overcome this shortcoming is to 
upper-bound it with a differentiable and convex loss function. For instance, we can use the 
the hinge loss to upper-bound the indicator function which leads to the following empirical 
error: 

£,{M,b):=- V [l+r{y„yj){dM{xi,Xj)-b)]+ (1) 

n{n — 1 ^-^ 

In order to avoid overfitting, we need to enforce a regularisation term denoted by ||M||, 
which will restrict the complexity of the distance matrix. We emphasize here || • || denotes 
a general matrix norm in the linear space S'^. Putting the regularisation term and the 
empirical error term together yields the following metric learning model: 

(M^,6^) = arg min {£^{M,b) + X\\M\\H , (2) 

where A > is a trade-off parameter. 

Different regularisation terms lead to different metric learning formulations. For in- 
stance, the Frobenius norm ||-/Vf||i? is used in Jin et al. (2009). To favor the element-sparsity, 
Rosales and Fung (2006) introduced the L^-norm regularisation ||M|| = fceNj l-^^fcl- 
Ying et al. (2009) proposed the mixed (2, l)-norm ||M|| = X^^eN^ (^^^feeN^ l-^^^l^) ^ ^° 
courage the column-wise sparsity of the distance matrix. The trace-norm regularisation 
||M|| = Y^iMM) was also considered by Ying et al. (2009); Shen et al. (2009). Here, 
{a£ : £ G Nrf} denote the singular values of a matrix M G S*^. Since M is symmetric, the 
singular values of M are identical to the absolute values of its eigenvalues. 



3 



Cao Guo Ying 



In analogy to the formulation of metric learning, we consider the following empirical 
error for similarity learning (Maurer, 2008; Chechik et al., 2010): 

£z{M,b) := ^ [l-r{yi,yj){sMixi,Xj) -b)]+. (3) 

This leads to the regularised formulation for similarity learning defined as follows: 

(M^,6z) = arg min {^^(M, 6) + A||M||2}. (4) 

Maurer (2008) used the Frobenius-norm regularisation for similarity learning. The trace- 
norm regularisation has been used by Shalit et al. (2010) to encourage a low-rank similarity 
matrix M. 

3. Statistical Generalization Analysis 

In this section, we mainly give a detailed proof of generalization bounds for metric and 
similarity learning. In particular, we develop a novel line of generalization analysis for metric 
and similarity learning with general matrix regularisation terms. The key observation is 
that the empirical data term £ziM,b) for metric learning is a modification of U-statistics 
and it is expected to converge to its expected form defined by 

£{M,b) = ll{l + r{y,y'){dM{x,x')-b))+dp{x,y)dp{x',y'). (5) 

The empirical term £2.{M,b) for similarity learning is expected to converge to 

£{M,b)= {l-r{y,y'){sMix,x') -b))+dp{x,y)dp{x',y'). (6) 



The target of generalization analysis is to bound the true error ^{Mz, ^z) by the empirical 
error £z,iMz,bz) for metric learning and £{A4z,bz) by the empirical error £z{Mz,bz) for 
similarity learning. 

In the sequel, we provide a detailed proof for generalization bounds of metric learning. 
Since the proof for similarity learning is exactly the same as that for metric learning, we 
only mention the results followed with some brief comments. 

3.1. Bounding the Solutions 

By the definition of {Mz,bz), we know that 

£z{Mz, bz) + A||Af^||2 < £z{Q, 0) + A||0|| = 1 

which implies that 

IIM.II < (7) 

Now we turn our attention to deriving the bound of the offset term bz by modifying the 
techniques in Chen et al. (2004) which was originally developed to estimate the offset term 
of the soft-margin SVM. 



4 



Generalization Bounds for Metric/Similarity Learning 



Lemma 1 For any samples z and A > 0, let (Mz, h^) he a minimizer of problem (2). Then, 
it satisfies that 



Hence, there holds 



m.m[dMAxi,Xj) - b^] < 1, inax[dM^ixi, Xj) - b^] > 1. (8) 

l&zl < 1 + (max||Xy||J||Mz||. (9) 

Proof Recall that Xij = {xi — Xj){xi — Xj)^ and observe, by the definition of the dual 
norm || • ||^, that 

dM{xi,Xj) = {Xij,M) < \\Xij\mM\\. 

Using the above observation, estimation (9) follows directly from inequality (8). Hence, it 
remains to prove the first statement. 

To this end, suppose thatr = mm[dM2.ixi,Xj) — bz] > 1. Then, dM^{xi,Xj) — {bz + r — l) > 

1 for any i ^ j. Consequently, 

^z(M„5,+r-l) Yl (.^ + dM^i^uXj)-b,-{r-l)f 

Yl i'^ + dMAxhXj)-bzf <£z{Mz,bz). 

Hence, £z{Mz, bz, + r — 1) + X\\Mz\\ < £'z(Mz, bz) + A||Mz;|| which contradicts the definition 
of the minimizer (Mz,6z). Hence, r = mm[dM^{xi, xj) — bz] < 1- In analogy to the above 

argument, it can be shown that maXi^j[dM^ixi, xj) — bz] > 1 which completes the proof of 
the lemma. ■ 

Denote 

^= {(M,6) : ||M|| < 1/\/A, |6| < 1 + X,||M||}, (10) 

where 

X* = max \\(x — x')(x — x')~^ W^:. 
x,x'ex 

From the above lemma, for any samples z we can easily see that the solution {Mz,bz) of 
(2) belongs to the bounded set I'CS'^ xR. 

3.2. Generalization Bounds 

Before stating the generalization bounds, we introduce some notations. For any z = 

{x,y),z' = {x',y') £ Z, let ^M,b{z,z') = {I + r{y,y'){dM{x,x') - b))+. Hence, for any 
(M,6)E^, 

sup sup <^M,biz,z') < Bx:=2{l + X,./VX). (11) 

Let [^\ denote the largest integer less than ^ and recall the definition that Xij = (xj — 
Xj){xi — Xj)'^ . We now define Rademacher average over sums-of-i.i.d. sample-blocks related 
to the dual matrix norm || • by 

I - 1 

1 II 



i=l 
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and its expectation is denoted by ii„ = [i?„] . Our main theorem below shows that the 
generalization bounds for metric learning critically depend on the quantity of For this 
reason, we refer to Rn as the Radmemcher complexity for metric learning. It is worth 
mentioning that metric learning formulation (2) depends on the norm || • || of the linear 
space S'^ and the Rademacher complexity i?„ is related to its dual norm || • ||*. 

Theorem 2 Let {M^^bz) be the solution of formulation (2). Then, for any {) < 8 < 1, with 
probability 1 — 5 we have that 



S{MM-£^{MM < sup £{M,b)-£z{M,b) 



4(3+2X,/VA) 



< iii" _l 

— Vn 



+ 2{l + X,/y/X) 



2 In 



(i) 



(13) 



Proof The proof of the theorem can be divided into three steps as follows. 

Step 1: Let denote the expectation with respect to samples z. Observe that 

£'(Mz,6z)-£:z(Mz,6z) < sup £{M,b)-£z{M,b) .For any z = (zi, Zk-i, Zk, Zk+i, Zn 
and = (zi, . . . , Zfc-i, -Zfe+i, . . . .,Zn) we know from inequality (11) that 



sup 

(M,6)eJ=" 



E(M, b)-Ez{M, b) 



sup 

(M,b)(^F 



£{M, b)-£z>{M, b) 



< sup \£z{M,b) -£z'{M,b)\ 

= Trar sup V \<^>M,bi^k,Zj) -^M,biz'k^Zj)\ 

^Turhi) \'^Md^k,zj)\ + \^M,bi4,zj)\ 

< 4(l + X,/VA)/n. 

Applying McDiarmid's inequality (McDiarmid, 1989) (see Lemma 5 in the Appendix) to 



the term sup 

{M,b)eT^ 



£{M, b)-£,{M, b) 



with probability 1 — 5 there holds 



sup 

{M,b)&T 



£{M, b)-£^{M, b) 



£{M, b)-£,{M,b) 



< Ez sup 

{M,b)eT^ 



+ 2(1 + X./VA)/'^^'^ 



(14) 



Now we only need to estimate the first term in the expectation form on the right-hand 
side of the above equation by symmetrization techniques. 

Step 2: To estimate Ez sup £{M,b)—£z{M,b) , applying Lemma 6 with g(jvf,fe)(-2j; -Zj) 

(A/,fe)eJ"'- 

£{M,b) - (1 + r{yi,yj){dMixi,Xj) - b))+ implies that 



Ez sup 

iM,b)£T L 



£iM, b)-£ziM, b) 



< Ez sup 

{M,b)eT'- 



£iM,b)-£,{M, b) 



(15) 
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where £z{M,b) = ^I'M.feC-Zi, Now let z = {zi, Z2, ■ ■ ■ , Zn} be i.i.d. samples 



i=l 



which are independent of z, then 

sup \£{M,b) -£^{M,b) 



= Ez sup 
< Ez^z sup 



E-^[£-^{M,b)] -£^{M,b) 
£i{M, b)-£^{M,b) 



(16) 



By standard symmetrization techniques (see e.g. Bartlett and Mendelson (2002)), for i.i.d. 
Rademacher variables {cjj G {il} : i G Npj}, we have that 



Ez,z sup 

{M,b)eT 



£i{M, b)-£^{M, b) 

LfJ 



E, 



'z,zt4t sup Vc7i k>jv/,i,(Zi,Zmj+i) - ^Af,6(^;i,^LfJ+«^ 
LfJ 



= 2Ez,aT4T sup y'(Ti$M,6(2i,^;| = 

LfJ 

< 2Ez,a-[i7 '^^P X]^*^*^'''*^^*'^L! 



(17) 



{M,b)€T 



i=l 



Applying the contraction property of Rademacher averages (see Lemma 7 in the Appendix) 
with ^'i(t) = (1 + ?'(yi, y[2.j+j)t)^ — 1, we have the following estimation for the last term 
on the righthand side of the above inequality: 



LfJ 



E^-^ sup ^Ti$M,6(^i,^[; 



(M,b)eJ" 



1=1 



LfJ 



Lf- 



I 1 I 

<E^-^ sup \'^ai{^M,b{zi,zi^^_^i) -I) + —Ej'^a, 



{M,b)eT 



i=l 

I - 

L 2 



i=l 

LfJ 



I 1 I 

<j^E^ sup l^ai{dAl{Xi,Xin^^-) -b) +TrrT^^\^ 



(18) 



{M,b)eT 



i=l 

LfJ 



i=l 



< jh'^o- sup |y^o-idAf(x^,X|n |_^.; 

^ I1A/|1<^'.=1 



LfJ 



(3 + 2X7^{A) 1^ 



LfJ 



j=i 



Step 3 : It remains to estimate the terms on the righthand side of inequality (18). To 
this end, observe that 

LfJ LfJ 



i=l 



1=1 
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Moreover, 



LfJ LfJ 
sup \^aidMixi,xin^^-) < E^ sup \(^ai{xi - x^n^_^-){xi - xi^^^^)'^ ^ M) 

P/||F<^'.=1 



1|M||<^' ,=1 



— E 



Putting the above estimations and inequalities (17), (18) together yields that 



Ez,z sup 



SiiM, b)-£^{M, b) 



^ 2(3 + 2X,/VA) 4(3 + X,/VA) ^ 2R, 



Consequently, combining this with inequalities (15), (16) implies that 



n 



Ez sup 



£{M, b)-£^{M, b) 



^ 4(3 + 2X,/VA) 4fi„ 



n 



Putting this estimation with (14) completes the proof the theorem. 



In the setting of similarity learning, and Rn are replaced by 

I - 1 

~ 1 II 
= sup \\xt^\\* and = — EzE^||^ (TjXj(Lnj^.) 



i=l 



(19) 



where X 



Let-F= |(M,6) : ||M|| < 1/VX, \b\ < 1+X,||M||}. Using the 

exactly same argument as above, we can prove the following bound for similarity learning 
formulation (4). 

Theorem 3 Let {Mzjbz) be the solution of formulation (4). Then, for any < 5 < 1, with 
probability 1 — 6 we have that 



£{M^,b^) - £^{M^, 6z) < sup ^ £{M, b) - £^{M, b) 



< ^ + i(3±^|zy^ + 2(1 + 1,/Va) 



2 In 



(i) 



(20) 



4. Estimation of i?„ and Discussion 

Prom Theorem 2, we need to estimate the Rademacher average for metric learning, i.e. Rn, 
and the quantity X^, for different matrix regularisation terms. Without loss of generality, 
we only focus on popular matrix norms such as the Frobenius norm (Jin et al., 2009), L^- 
norm (Rosales and Fung, 2006), trace-norm (Ying et al., 2009; Shen et al., 2009) and mixed 
(2, l)-norm (Ying et al., 2009). 
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Example 1 (Frobenius norm) Let the matrix norm be the Frobenius norm i.e. ||M|| = 
||M||i?, then the quantity X* = sup^ ^g^^- — and the Rademacher complexity is esti- 
mated as follows: 

2X* _ 2sup^y^;Y\\x — x'Wp 



n yjn 



Let {Mzjbz) be a solution of formulation (2) with Frobenius norm regularisation. For any 
< 5 < 1, with probability 1 — 6 there holds 



Proof Note that the dual norm of the Frobenius norm is itself. The estimation of X^.. is 
straightforward. The Rademacher complexity Rn is estimated as follows: 



J ~ 

<^*//nT<^- 

Putting this estimation back into equation (13) completes the proof of Example 1. ■ 

Other popular matrix norms for metric learning are the L^-norm, trace-norm and mixed 
(2, l)-norm. The dual norms are respectively L°°-norm, spectral norm (i.e. the maximum 
of singular values) and mixed (2, oo)-norm. All these dual norms mentioned above are less 
than the Frobenius norm. Hence, the following estimation always holds true for all the 
norms mentioned above: 



< sup — x'llp^, and Rn < 



2sup^ /g_^ \\x — x'Wp 



/l|2 _j D ^ ^^^^Px.x'gA^ 



Consequently, the generalization bound (21) holds true for metric learning formulation (2) 
with L^-norm, or trace-norm or mixed (2, l)-norm regularisation. However, in some cases, 
the above upper-bounds are too conservative. For instance, in the following examples we 
can show that more refined estimation of Rn can be obtained by applying the Khinchin 
inequalities for Rademacher averages (Peha and Gine, 1999). 

Example 2 (Sparse L-'^-norm) Let the matrix norm be the L^-norm i.e. ||M|| = Y2e fceN^ 
Then, X^: = sup^y^p^; \\x — x'W^ and 

TJ ^ A II 'Il2 h\ogd 

Rn < 4 sup \\X — X " ■ ' 



x,x'ex V n 
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Let {Mz,bz) be a solution of formulation (2) with L^-norm regularisation. For any < 6 < 
1, with probability 1 — 6 there holds 



£{Mz,bz) -£ziMz,bz) <2 1 + 



Ml 



VnX V^' 



(22) 



Proof The dual norm of the L^-norm is the L°°-norm. Hence, X* = sup^ x'eAr 
To estimate we observe, for any 1 < g < oo, that 



X — X 



l\\2 



L-J 

^LfJ 



< — ^F. F, 

oo L2 J 



2 - 

<J \ 1 



(23) 



where x^ represents the A:-th coordinate element of vector Xi G M'^. To estimate the term 
on the right-hand side of inequality (23), we apply the Khinchin-Kahane inequality (See 
Lemma 8 in the Appendix) with p = 2 < q < oo yields that 



Putting the above estimation back into (23) and letting q = A log d implies that 

„'l|2 



2. ■ - 

'(^f - ^iij+J')' < max,,,.,e;f \\x - x'f^{[%\)iqi . 



(24) 



< 4 sup ||x — X' 



Rn < max ||x - x'||^d'J^/4/[^J = 2 sup 



\x — X 



x,x'ex 



e\ogd/[-\ 



/||2 



: log d/n. 



Putting the estimation for and Rn into Theorem 13 yields inequality (22). This com- 
pletes the proof of Example 2. ■ 



Example 3 (Mixed (2, l)-norm) Consider \\M\\ = Xl^eNd y X^fceN^ l^^fcP- Then, we have 
= [sup^ .j./g;^' ||x — x'\\p\ [supj, .j./g;^' ||x — x'lloo] , and 



_R„<4[sup llx - x'lloo] [ sup \\x - X \\f\\ ^^^^^ ■ 

x,x'&X x,x'^X V n 

Let (Mzjbz) be a solution of formulation (2) with mixed {2,l)-norm. For any < 5 < 1, 
with probability 1 — 6 there holds 



(25) 
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Proof The estimation of X* is straightforward and we estimate Rn as follows. For any 
q > 1, there holds 



Rn — I^IiJ -t^z^cr 



(2,00) 



1 

|2\ 2 



(26) 



1 

|2\ 2 



It remains to estimate the terms inside the parenthesis on the right-hand side of the above 
inequality. To this end, we observe, for any q' > 1, that 



L~J k k £ £ \ 



Applying the Khinchin-Kahane inequality (Lemma 8 in the Appendix) with q = 2q' = 4 log d 
and p = 2 to the above inequality yields that 



E\_^\ ( k k \f £ I \\'^ 



\21'? \ 9' 



< 2q' sup^^^/g^t' Ik - x'll^d? [^.J^ (xf - xf„ 

< 4e(log d) sup^^^,g;t- llx - X 



\2^ 



/ii2 rv^LfJ/^fc h 



xj- — X 



Putting the above estimation back into (26) implies that 



Rn < V4elog(i[sup2,yg;^. ||x - x'||oo]Ez (Eiii W^i - ^l^l+iWr) ' /Lf J 

/vTfl 



< V4e log (i 

< 4^e log d 



sup2,.,x'eA' IF — 2; II 00 
sup^,. x'eAr ll^; — X II 00 



sup^,^/eA' 112; - X \\F 



n. 



Combining this with Theorem 2 implies the inequality (25). This completes the proof of 
the example. ■ 



In the Frobenius-norm case, the main term of the bound (21) is O 



II /||2 

'^^'i'x,x' <^x IF— 2; II p 



This bound is consistent with that given by Jin et al. (2009) where supj.g;^;' ||x||i? is assumed 
to bounded by some constant B. Comparing the generalization bounds in the above ex- 
amples. The key terms X* and i?„ mainly differ in two quantities, i.e. sup^. ||x — 
x'IIf and sn\>. j. .^.ii^x 11^ ~ 2;'||oo- We argue that snY>.j. .j.i^x 11^ ~ a;'||oo can be much less than 
s\vpx,x'<^x \\x — x'Wf- For instance, consider the input space X = [0, l]'^. It is easy to see that 



sup^, ||x — x'||_F = ^^d while sup^ ^i^^ 
the estimations as follows: 



\x — X 



1. Consequently, we can summarise 
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Probenius-norm: In this case, we have X^, = d, and Rn < 



• Sparse L^-norm: In this setting, we can see that X^, = 1, and i?„ < ^^^^ . 

• Mixed (2, l)-norm: We obtain that X* = Vd, and i?„ < iV^gd^ 

Therefore, when d is large, the generahzation bound with sparse L^-norm regularisation 
is much better than that with Frobenius-norm regularisation while the bound with mixed 
(2, l)-norm is between those with Frobenius norm and L^-norm. These theoretical results 
are nicely consistent with the rationale that sparse methods are more effective in dealing 
with high-dimensional data. 

We end this section with two remarks. Firstly, in the setting of trace-norm regularisation, 
it remains a question to us on how to establish more accurate estimation of Rn by using 
the Khinchin-Kahane inequality. Secondly, the bounds in the above examples are true for 
similarity learning with different matrix-norm regularisation. Indeed, the generalization 
bound for similarity learning in Theorem 3 tells us that it suffices to estimate X* and 
In analogy to the arguments in the above examples, we can get the following results. For 
similarity learning formulation (4) with Frobenius-norm regularisation, there holds 

V II ii2 5 / 2sup^ 

X^ = sup \\x\\p, Rn < 



For L^-norm regularisation, we have 

X^, = sup Rn < 4sup||x||^\/elogd/V^- 

x£X xX 

In the setting of (2, l)-norm, we obtain 

X^ = sup ||x||oo sup \\x\\f, Rn < 4[sup \\x\\f SUp ||x||oo] \/ G log d/ y/n. 

xeX xeX xeX x£X 

Putting these estimations back into Theorem 3 yields generalization bounds for similarity 
learning with different matrix norms. For simplicity, we omit the details here. 



5. Conclusion 

In this paper we are mainly concerned with theoretical generalization analysis of the regu- 
larized metric and similarity learning. In particular, we first showed that the generalization 
analysis for metric/similarity learning reduces to the estimation of the Rademacher aver- 
age over "sums-of-i.i.d." sample-blocks. Then, we derived their generalization bounds with 
different matrix regularisation terms. Our analysis indicates that sparse metric/similarity 
learning with L^-norm regularisation could lead significantly better bounds than that with 
the Frobenius norm regularisation, especially when the dimension of the input data is high. 
Our novel generalization analysis develops the techniques of U-statistics (Peha and Gine, 
1999; Clemencon et al., 2008) and Rademacher complexity analysis (Bartlett and Mendelson, 
2002; Koltchinskii and Panchenko, 2002). In future we are planning to improve the gen- 
eralization bounds for metric and similarity learning with trace-norm regularisation. The 
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target of supervised metric learning is to improve the generalization performance of kNN 
classifiers. It would be very interesting to investigate how the generalization performance 
of kNN classifiers relates to the generalization bounds of metric learning given here. 
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Appendix 

In this appendix we assemble some facts, which were used to establish generalization bounds 
for metric/similarity learning. 

n 

Definition 4 We say the function f : — )• M with bounded differences {ckY^^i f^^ 

k=l 

all 1 < k < n, 

max 1/(2:1, • • • , Zk^i,Zk, Zk+i, ■■■ ,Zn) 

-f{zi,--- ,Zk-l,z'j^,Zk+l,--- ,Zn)\ < Ck 
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Lemma 5 (McDiarmid's inequality (McDiarmid, 1989)) Suppose f : 
bounded differences {ck}^^i then , for all e > 0, there holds 



with 



k=l 



PrJ/(z)-E,/(z) >e <e 



Finally we list a useful property for U-statistics. Given the i.i.d. random variables 
zi, Z2, ■ ■ ■ , Zn G Z, let q : Z x Z — t- M be a symmetric real- valued function. Denote a U- 
statistics of order two by Un = n{n_i-^ q{xi,Xj). Then, the U-statistic C/„ can be expressed 



as 



I - 

L 9. 



1 1 



(27) 



where the sum is taken over all permutations vr of {1,2,... , n}. The main idea underlying 
this representation is to reduce the analysis to the ordinary case of i.i.d. random variable 
blocks. Based on the above representation, we can prove the following lemma which plays 
a critical role in deriving generalization bounds for metric learning. For completeness, we 
include a proof here. For more details on U-statistics, one is referred to Clemencon et al. 
(2008); Peha and Gine (1999). 

Lemma 6 Let qr : Z x Z be real-valued functions indexed by t G T where T is some 
index set. If zi, . . . , Zn are i.i.d. then we have that 



E 



I - 1 

1 1 r 1 



Proof From the representation of U-statistics (27), we observe that 

LfJ 



E 



sup ^ Y] qT{zi,Zj 
■rer n{n " 1) ^ 



= ^^^P ^ I] <7r(^.(i), ^.(L|J+i)) 

^ ^^IE^sup^^g^(z^(i),z^(L|j+i)) 
= ;^2^^Esup^^g,(z^(i),z^(Lnj+i)) 



L 

LfJ 



2 J i=i 



E 



■rer L2 



This completes the proof of the lemma. 
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We need the following contraction property of the Rademacher averages which is essen- 
tially implied by Theorem 4.12 in Ledoux and Talagrand Ledoux and Talagrand (1991), see 
also Bartlett and Mendelson (2002); Koltchinskii and Panchenko (2002). 

Lemma 7 Let F be a class of uniformly bounded real-valued functions on (0, /i) and m E N. 
If for each i G {1, . . . ,m}, : M — )• M is a function with ^'j(O) = having a Lipschitz 
constant Ci, then for any {xi}^^, 




The last property of Rademacher averages is the Khinchin-Kahne inequality (see e.g. 
Peha and Gine (1999, Theorem 1.3.1)). 

Lemma 8 For n G N, let {/j G M : i G N„}, and {uj : i G N„} be a family of i.i.d. 
Rademacher variables. Then, for any l<p<q<oowe have 

- 1 - 
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